This study is the analysis of the three following Mycobacterium smegmatis samples:

Alexandre Stella and Hung Le prepared 1 mg of digested peptides (Urea lysis followed by in solution digestion) and subjected 500ug to TiO2 enrichment. The samples prior and after phospho-enrichment were injected onto the Q-Exactive Plus in technical duplicate. We work with 3 biological replicates (independent transfections).

Proteome

Data set inspection

The raw files of the proteomic data set are in RAW/Proteome_MQ041217/. We work with the Max Quant table proteinGroups.txt.

I remove the “REV_” and “CON_”, and I keep only the proteins identified/quantified with a minimum of 2 unique peptides.

There are 3563 proteins identified in the study.

Retrieve the gene names in Uniprot to correct the trimmed “Fasta header” values

I export the protein IDs for manual retrieving using https://www.uniprot.org/uploadlists/ (20180117).

We searched the data with 2 proteome annotations from Uniprot (strain ATCC 700084 / mc(2)155 Uniprot IDs UP000000757 and UP000006158 with 6602 and 6585 entries, respectively). I choose a unique ID to be able to match the proteomic to the phospho-proteomic data later on.

The new column “Proteome” contains the information of from which proteome comes the ID. The “FALSE” are the “Pknbtub” sequences that we added and a “Biognosys7”.

Var1 Freq
44
FALSE 2
UP000000757: Chromosome 583
UP000000757: Chromosome; UP000006158: Chromosome 475
UP000001584: Chromosome 1
UP000006158: Chromosome 2458

To simplify the table, I keep only one gene name if possible. I decide to first keep the gene name. If this is not available, I keep the“MSMEG” ID, if there is none I keep the “MSMEI”.

I manually check that there is no ID ambiguity in the protein names attribution.

Normalisation

I use the LFQ values for the quantification.

In the following boxplot:

  • the red points are the quan values for PknB of smegmatis.
  • the blue points are the quan values for PknB of tub.
  • the green points are the quan values for mutant PknB of tub.

Plot PknB quan values

## Using Protein.IDs as id variables

## Using Protein.IDs as id variables

Multivariate analysis

Reproducibility of the MS runs:

I save the protein table as OutputTables/NormIntProt_20190127.txt.

Statistical analysis

Calculate the mean of the technical repeats:

Figure PknB levels in the different samples

Verification that the mutation is as expected: I plot the mean signal of the peptides containing the mutation (sequence “”) that are detected with MSMS.

## Using Protein.IDs as id variables
## Using ProteinIDs as id variables

## quartz_off_screen 
##                 2

Replacement of missing values

I replace with 1% quantile from each condition

I replace missing values when there is one or no value detected across the 3 experiments in one condition (conditions being L, K, P).

I perform a Welch two-sided t-test followed by a BH correction of the pvalue:

Volcano plots

## quartz_off_screen 
##                 2

## quartz_off_screen 
##                 2

Normalisation factor for phosphoproteomics data set

I calculate a normalisation factor for each protein in each run. The vertical red line indicates the median intensity of PknB.


Phospho-enriched samples

Quality check

I remove the CON_ and REV_.

There are 3798 phosphorylation sites identified in the study (from 1339 proteins).

I keep only the sites with 75% localisation probability (above or equal):

There are 2256 phosphorylation sites identified in the study (from 1175 proteins).

Retrieve the gene names in Uniprot to correct the trimmed “Fasta header” values

I export the protein IDs for manual retrieving using https://www.uniprot.org/uploadlists/ (20180117).

We searched the data with 2 proteome annotations from Uniprot (see document ProteomesComparison). I choose a unique ID to be able to match the proteomic to the phospho-proteomic data later on.

## No id variables; using all as measure variables

Normalisation based on the iRTs

## No id variables; using all as measure variables

Normalisation to the protein changes

Each protein is normalised with a normalisation factor calculated in the proteomics data set.

I remove the CON_ and REV_ in the phospho table.

There are 2256 rows in the phospho table.

I finely match the protein ID from the phospho data set to the proteome.

We match 94.33% of the sites to the corresponding protein value in the proteome.

## No id variables; using all as measure variables

## quartz_off_screen 
##                 2

Calculation of the values per biological repeat

## quartz_off_screen 
##                 2

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     6.2088 3.4700 0.99933 0.90857 0.63822 0.56800
## Proportion of Variance 0.7176 0.2241 0.01859 0.01537 0.00758 0.00601
## Cumulative Proportion  0.7176 0.9417 0.96028 0.97564 0.98323 0.98923
##                            PC7    PC8     PC9
## Standard deviation     0.52518 0.4334 0.33884
## Proportion of Variance 0.00513 0.0035 0.00214
## Cumulative Proportion  0.99437 0.9979 1.00000

## quartz_off_screen 
##                 2

I add the information in the output table.

Replacement of missing values

Selective replacement when there is only one or zero measurement for a site in a given condition (K, L, P).

Replacement with 1% quantile of all the conditions, to avoid a bias due to the increase in general intensity when the kinase is over-expressed.

t-test

Welch two-sided t-test followed by BH correction of the pvalue.

Volcano plots

## 
## FALSE  TRUE 
##  2254     2
## 
## FALSE  TRUE 
##   871  1385
## 
## FALSE  TRUE 
##   870  1386

Create phosID

Volcano with sites of interest:

## quartz_off_screen 
##                 2

Volcano for the paper (with the known substrates of PknB):

## quartz_off_screen 
##                 2

## R version 3.5.2 (2018-12-20)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.2
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggrepel_0.8.0  gplots_3.0.1   knitr_1.21     corrplot_0.84 
## [5] reshape2_1.4.3 ggplot2_3.1.0 
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.0         RColorBrewer_1.1-2 pillar_1.3.1      
##  [4] compiler_3.5.2     plyr_1.8.4         highr_0.7         
##  [7] bitops_1.0-6       tools_3.5.2        digest_0.6.18     
## [10] evaluate_0.12      tibble_2.0.1       gtable_0.2.0      
## [13] pkgconfig_2.0.2    rlang_0.3.1        yaml_2.2.0        
## [16] xfun_0.4           withr_2.1.2        stringr_1.3.1     
## [19] gtools_3.8.1       caTools_1.17.1.1   grid_3.5.2        
## [22] rmarkdown_1.11     gdata_2.18.0       magrittr_1.5      
## [25] scales_1.0.0       htmltools_0.3.6    colorspace_1.4-0  
## [28] labeling_0.3       KernSmooth_2.23-15 stringi_1.2.4     
## [31] lazyeval_0.2.1     munsell_0.5.0      crayon_1.3.4